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Abstract 

In  this  paper,  we  present  a  unigram  segmen¬ 
tation  model  for  statistical  machine  transla¬ 
tion  where  the  segmentation  units  are  blocks: 
pairs  of  phrases  without  internal  structure.  The 
segmentation  model  uses  a  novel  orientation 
component  to  handle  swapping  of  neighbor 
blocks.  During  training,  we  collect  block  un¬ 
igram  counts  with  orientation:  we  count  how 
often  a  block  occurs  to  the  left  or  to  the  right  of 
some  predecessor  block.  The  orientation  model 
is  shown  to  improve  translation  performance 
over  two  models:  1)  no  block  re-ordering  is 
used,  and  2)  the  block  swapping  is  controlled 
only  by  a  language  model.  We  show  exper¬ 
imental  results  on  a  standard  Arabic -English 
translation  task. 

1  Introduction 

In  recent  years,  phrase-based  systems  for  statistical  ma¬ 
chine  translation  (Och  et  al.,  1999;  Koehn  et  al.,  2003; 
Venugopal  et  al.,  2003)  have  delivered  state-of-the-art 
performance  on  standard  translation  tasks.  In  this  pa¬ 
per,  we  present  a  phrase-based  unigram  system  similar 
to  the  one  in  (Tillmann  and  Xia,  2003),  which  is  ex¬ 
tended  by  an  unigram  orientation  model.  The  units  of 
translation  are  blocks,  pairs  of  phrases  without  internal 
structure.  Fig.  1  shows  an  example  block  translation  us¬ 
ing  five  Arabic-English  blocks  6i,  •  ■  • ,  65.  The  unigram 
orientation  model  is  trained  from  word-aligned  training 
data.  During  decoding,  we  view  translation  as  a  block 
segmentation  process,  where  the  input  sentence  is  seg¬ 
mented  from  left  to  right  and  the  target  sentence  is  gener¬ 
ated  from  bottom  to  top,  one  block  at  a  time.  A  monotone 
block  sequence  is  generated  except  for  the  possibility  to 
swap  a  pair  of  neighbor  blocks.  The  novel  orientation 
model  is  used  to  assist  the  block  swapping:  as  shown  in 
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Figure  1 :  An  Arabic-English  block  translation  example 
taken  from  the  devtest  set.  The  Arabic  words  are  roman- 
ized. 

section  3,  block  swapping  where  only  a  trigram  language 
model  is  used  to  compute  probabilities  between  neighbor 
blocks  fails  to  improve  translation  performance.  (Wu, 
1996;  Zens  and  Ney,  2003)  present  re-ordering  models 
that  make  use  of  a  straight/inverted  orientation  model  that 
is  related  to  our  work.  Here,  we  investigate  in  detail 
the  effect  of  restricting  the  word  re-ordering  to  neighbor 
block  swapping  only. 

In  this  paper,  we  assume  a  block  generation  process  that 
generates  block  sequences  from  bottom  to  top,  one  block 
at  a  time.  The  score  of  a  successor  block  b  depends  on  its 
predecessor  block  b'  and  on  its  orientation  relative  to  the 
block  b' .  In  Fig.  1  for  example,  block  61  is  the  predeces¬ 
sor  of  block  62,  and  block  62  is  the  predecessor  of  block 
63.  The  target  clump  of  a  predecessor  block  b'  is  adja- 
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cent  to  the  target  clump  of  a  successor  block  b.  A  right 
adjacent  predecessor  block  b'  is  a  block  where  addition¬ 
ally  the  source  clumps  are  adjacent  and  the  source  clump 
of  b'  occurs  to  the  right  of  the  source  clump  of  b.  A  left 
adjacent  predecessor  block  is  defined  accordingly. 
During  decoding,  we  compute  the  score  P{b",o")  of  a 
block  sequence  6"  with  orientation  o"  as  a  product  of 
block  bigram  scores: 

n 

P(6”,o”)  «  (1) 

j=i 

where  6,  is  a  block  and  o,  £  {L,  i?,  A^}  is  a  three-valued 
orientation  component  linked  to  the  block  6,  (the  orienta¬ 
tion  o,_i  of  the  predecessor  block  is  ignored.).  A  block 
bi  has  right  orientation  (o,  =  R)  if  it  has  a  left  adjacent 
predecessor.  Accordingly,  a  block  bi  has  left  orientation 
(oi  =  L)  if  it  has  a  right  adjacent  predecessor.  If  a  block 
has  neither  a  left  or  right  adjacent  predecessor,  its  orien¬ 
tation  is  neutral  (o,  =  N).  The  neutral  orientation  is  not 
modeled  explicitly  in  this  paper,  rather  it  is  handled  as  a 
default  case  as  explained  below.  In  Fig.  1,  the  orienta¬ 
tion  sequence  is  of  =  {N,  L,N,N,L),  i.e.  block  62  and 
block  65  are  generated  using  left  orientation.  During  de¬ 
coding  most  blocks  have  right  orientation  (o  =  R),  since 
the  block  translations  are  mostly  monotone. 

We  try  to  find  a  block  sequence  with  orientation  (6",  o") 
that  maximizes  P(6",o").  The  following  three  types 
of  parameters  are  used  to  model  the  block  bigram  score 
p{bi,Oi\bi_i,Oi_i)  in  Eq.  1: 

•  Two  unigram  count-based  models:  p{b)  and 
Pb{o).  We  compute  the  unigram  probability  p(6)  of 
a  block  based  on  its  occurrence  count  N{b).  The 
blocks  are  counted  from  word-aligned  training  data. 
We  also  collect  unigram  counts  with  orientation:  a 
left  count  and  a  right  count  N[i{b).  These 

counts  are  defined  via  an  enumeration  process  and 
are  used  to  define  the  orientation  model  pb  (o) : 


Pb(o  e  {L,R}) 


No{b) 

NL{b)+NR{b)- 


•  Trigram  language  model:  The  block  language 
model  score  p(bi\bi-i)  is  computed  as  the  proba¬ 
bility  of  the  first  target  word  in  the  target  clump  of 
bi  given  the  final  two  words  of  the  target  clump  of 
bi-i- 

The  three  models  are  combined  in  a  log-linear  way,  as 
shown  in  the  following  section. 


occurs  to  the  left  of  block  bi .  Although  the  joint  block 
(62,  &i)  consisting  of  the  two  smaller  blocks  61  and  62 
has  not  been  seen  in  the  training  data,  we  can  still  profit 
from  the  fact  that  block  62  occurs  more  frequently  with 
left  than  with  right  orientation.  In  our  Arabic -English 
training  data,  block  62  has  been  seen  A^l(&2)  =  52  times 
with  left  orientation,  and  A^i?(62)  =  0  with  right  orien¬ 
tation,  i.e.  it  is  always  involved  in  swapping.  This  intu¬ 
ition  is  formalized  using  unigram  counts  with  orientation. 
The  orientation  model  is  related  to  the  distortion  model 
in  (Brown  et  ah,  1993),  but  we  do  not  compute  a  block 
alignment  during  training.  We  rather  enumerate  all  rele¬ 
vant  blocks  in  some  order.  Enumeration  does  not  allow 
us  to  capture  position  dependent  distortion  probabilities, 
but  we  can  compute  statistics  about  adjacent  block  prede¬ 
cessors. 

Our  baseline  model  is  the  unigram  monotone  model  de¬ 
scribed  in  (Tillmann  and  Xia,  2003).  Here,  we  select 
blocks  b  from  word-aligned  training  data  and  unigram 
block  occurrence  counts  N{b)  are  computed:  all  blocks 
for  a  training  sentence  pair  are  enumerated  in  some  order 
and  we  count  how  often  a  given  block  occurs  in  the  par¬ 
allel  training  data  ' .  The  training  algorithm  yields  a  list 
of  about  65  blocks  per  training  sentence  pair.  In  this  pa¬ 
per,  we  make  extended  use  of  the  baseline  enumeration 
procedure:  for  each  block  b,  we  additionally  enumerate 
all  its  left  and  right  predecessors  6'.  No  optimal  block 
segmentation  is  needed  to  compute  the  predecessors:  for 
each  block  b,  we  check  for  adjacent  predecessor  blocks  b' 
that  also  occur  in  the  enumeration  list.  We  compute  left 
orientation  counts  A^l(5)  as  follows: 

NL{b)  =  E  1- 

3  b'  right  adjacent  predecessor  of  b 

Here,  we  enumerate  all  adjacent  predecessors  b'  of  block 
b  over  all  training  sentence  pairs.  The  identity  of  b'  is  ig¬ 
nored.  Nr  (b)  is  the  number  of  times  the  block  b  succeeds 
some  right  adjacent  predecessor  block  b' .  The  ’right’  ori¬ 
entation  count  Nr  (6)  is  defined  accordingly.  Note,  that 
in  general  the  unigram  count  N{b)  7^  A^l(5)  +  NR{b): 
during  enumeration,  a  block  b  might  have  both  left  and 
right  adjacent  predecessors,  either  a  left  or  a  right  adja¬ 
cent  predecessor,  or  no  adjacent  predecessors  at  all.  The 
orientation  count  collection  is  illustrated  in  Eig.  2:  each 
time  a  block  b  has  a  left  or  right  adjacent  predecessor  in 
the  parallel  training  data,  the  orientation  counts  are  incre¬ 
mented  accordingly. 

The  decoding  orientation  restrictions  are  illustrated  in 
Eig  3:  a  monotone  block  sequence  with  right  (o  =  R) 


2  Orientation  Unigram  Model 

The  basic  idea  of  the  orientation  model  can  be  illustrated 
as  follows:  In  the  example  translation  in  Eig.  1,  block  62 


'We  keep  all  blocks  for  which  N{b)  >  2  and  the  phrase 
length  is  less  or  equal  8.  No  other  selection  criteria  are  applied. 
For  the  51  &  OR  model,  we  keep  all  blocks  for  which  N{b)  > 
3. 


N^(b)  +=  1 


N  (b)  +=  1 


Figure  2:  During  training,  blocks  are  enumerated  in  some 
order:  for  each  block  b,  we  look  for  left  and  right  adjacent 
predecessors  b' . 

orientation  is  generated.  If  a  block  is  skipped  e.g.  block 
63  in  Fig  3  by  first  generating  block  62  then  block  63,  the 
block  63  is  generated  using  left  orientation  03  =  L.  Since 
the  block  translation  is  generated  from  bottom-to-top,  the 
blocks  62  and  64  do  not  have  adjacent  predecessors  below 
them:  they  are  generated  by  a  default  model  p{bi\bi-i) 
without  orientation  component.  The  orientation  model 
is  given  in  Eq.  2,  the  default  model  is  given  in  Eq.  3. 
The  block  bigram  model  p(6, ,  o,  e  {L,  o,_i)  in 

Eq.  1  is  defined  as: 

p{bi,Oi  e  {L,R}  \bi_i,Oi_i)  =  (2) 

=  -Pbiioir^ 

where  ao  +  ai  +  a2  =  1.0  and  the  orientation  Oj_i  of  the 
predecessor  is  ignored.  The  a*  are  chosen  to  be  optimal 
on  the  devtest  set  (the  optimal  parameter  setting  is  shown 
in  Table.  1).  Only  two  parameters  have  to  be  optimized 
due  to  the  constraint  that  the  a,  have  to  sum  to  1.0.  The 
default  model  Oj  =  N\bi-i,Oi-i)  =  p{bi\bi-i)  is 
defined  as: 

p{bi\bi-i)  =  p{bi)°'°  ■  p{bi\bi_i)‘^\  (3) 

where  aQ  +  =  1.0.  The  are  not  optimized  sepa¬ 

rately,  rather  we  define:  an  =  — ^ — . 

Straightforward  normalization  over  all  successor  blocks 
in  Eq.  2  and  in  Eq.  3  is  not  feasible:  there  are  tens  of  mil¬ 
lions  of  possible  successor  blocks  b.  In  future  work,  nor¬ 
malization  over  a  restricted  successor  set,  e.g.  for  a  given 
source  input  sentence,  all  blocks  b  that  match  this  sen¬ 
tence  might  be  useful  for  both  training  and  decoding.  The 
segmentation  model  in  Eq.  1  naturally  prefers  translations 
that  make  use  of  a  smaller  number  of  blocks  which  leads 
to  a  smaller  number  of  factors  in  Eq.  1 .  Using  fewer  ’big¬ 
ger’  blocks  to  carry  out  the  translation  generally  seems 
to  improve  translation  performance.  Since  normalization 
does  not  influence  the  number  of  blocks  used  to  carry  out 
the  translation,  it  might  be  less  important  for  our  segmen¬ 
tation  model. 

We  use  a  DP-based  beam  search  procedure  similar  to  the 
one  presented  in  (Tillmann  and  Xia,  2003).  We  maximize 


Eigure  3:  During  decoding,  a  mostly  monotone  block  se¬ 
quence  with  (oj  =  R)  orientation  is  generated  as  shown 
in  the  left  picture.  In  the  right  picture,  block  swapping 
generates  block  63  to  the  left  of  block  62-  The  blocks  62 
and  64  do  not  have  a  left  or  right  adjacent  predecessor. 

over  all  block  segmentations  with  orientation  (6",  o")  for 
which  the  source  phrases  yield  a  segmentation  of  the  in¬ 
put  sentence.  Swapping  involves  only  blocks  {b,b')  for 
which  >  3  for  the  successor  block  b,  e.g.  the 

blocks  62  and  65  in  Eig  1 .  We  tried  several  thresholds  for 
A^l(6),  and  performance  is  reduced  significantly  only  if 
NL{b)  >  30.  No  other  parameters  are  used  to  control 
the  block  swapping.  In  particular  the  orientation  o'  of  the 
predecessor  block  b'  is  ignored:  in  future  work,  we  might 
take  into  account  that  a  certain  predecessor  block  b'  typi¬ 
cally  precedes  other  blocks. 

3  Experimental  Results 

The  translation  system  is  tested  on  an  Arabic-to-English 
translation  task.  The  training  data  comes  from  the  UN 
news  sources:  87.5  million  Arabic  and  97.1  million  En¬ 
glish  words.  The  training  data  is  sentence-aligned  yield¬ 
ing  3.3  million  training  sentence  pairs.  The  Arabic  data 
is  romanized,  some  punctuation  tokenization  and  some 
number  classing  are  carried  out  on  the  English  and  the 
Arabic  training  data.  As  devtest  set,  we  use  testing 
data  provided  by  LDC,  which  consists  of  1  043  sen¬ 
tences  with  25  889  Arabic  words  with  4  reference  trans¬ 
lations.  As  a  blind  test  set,  we  use  MT  03  Arabic-English 
DARPA  evaluation  test  set  consisting  of  663  sentences 
with  16  278  Arabic  words. 

Three  systems  are  evaluated  in  our  experiments:  50  is  the 
baseline  block  unigram  model  without  re-ordering.  Here, 
monotone  block  alignments  are  generated:  the  blocks 
bi  have  only  left  predecessors  (no  blocks  are  swapped). 
This  is  the  model  presented  in  (Tillmann  and  Xia,  2003). 
Eor  the  51  model,  the  sentence  is  translated  mostly 
monotonously,  and  only  neighbor  blocks  are  allowed  to 
be  swapped  (at  most  1  block  is  skipped).  The  51  &  OR 
model  allows  for  the  same  block  swapping  as  the  51 
model,  but  additionally  uses  the  orientation  component 
described  in  Section  2:  the  block  swapping  is  controlled 


Table  1:  Effect  of  the  orientation  model  on  Arabic- 
English  test  data:  LDC  devtest  set  and  DARPA  MT  03 

blind  test  set. 


Test 

Unigram 

Model 

Setting 

(ao/ ai/ 0:2) 

BLEUr4n4 

Dev  test 

51 

.74/. 26 

0.344  ±0.012 

50 

.77/. 23 

0.355  ±0.013 

51  &  OR 

.66/. 27/. 07 

0.368  ±0.014 

Test 

51 

.74/. 26 

0.336  ±0.017 

50 

.77/. 23 

0.339  ±0.016 

51  &  OR 

.66/. 27/. 07 

0.356  ±0.017 

Table  2:  Arabic -English  example  blocks  from  the  de¬ 
vtest  set:  the  Arabic  phrases  are  romanized.  The  example 
blocks  were  swapped  in  the  development  test  set  transla¬ 
tions.  The  counts  are  obtained  from  the  parallel  training 
data. 


Arabic-English  blocks 

Vi  (6) 

Nnib) 

(’exhibition’  |  ’mErD’) 

97 

32 

(’added’  |  ’wADAf’) 

285 

68 

(’said’  I’wqAl’) 

872 

801 

(’suggested  |  ’AqtrH’) 

356 

729 

(’terrorist  attacks’  |  hjmAt  ArhAbyp’) 

14 

27 

by  the  unigram  orientation  counts.  The  50  and  51  mod¬ 
els  use  the  block  bigram  model  in  Eq.  3:  all  blocks  b 
are  generated  with  neutral  orientation  (o  =  N),  and  only 
two  components,  the  block  unigram  model  p{bi)  and  the 
block  bigram  score  p(6,  |6,_i)  are  used. 

Experimental  results  are  reported  in  Table  1 :  three  BLEU 
results  are  presented  for  both  devtest  set  and  blind  test 
set.  Two  scaling  parameters  are  set  on  the  devtest  set  and 
copied  for  use  on  the  blind  test  set.  The  second  column 
shows  the  model  name,  the  third  column  presents  the  op¬ 
timal  weighting  as  obtained  from  the  devtest  set  by  car¬ 
rying  out  an  exhaustive  grid  search.  The  fourth  column 
shows  BLEU  results  together  with  confidence  intervals 
(Here,  the  word  casing  is  ignored).  The  block  swapping 
model  51  &  OR  obtains  a  statistical  significant  improve¬ 
ment  over  the  baseline  50  model.  Interestingly,  the  swap¬ 
ping  model  51  without  orientation  performs  worse  than 
the  baseline  50  model:  the  word-based  trigram  language 
model  alone  is  too  weak  to  control  the  block  swapping: 
the  model  is  too  unrestrictive  to  handle  the  block  swap¬ 
ping  reliably.  Additionally,  Table  2  presents  devtest  set 
example  blocks  that  have  actually  been  swapped.  The 
training  data  is  unsegmented,  as  can  be  seen  from  the 
first  two  blocks.  The  block  in  the  first  line  has  been  seen 
3  times  more  often  with  left  than  with  right  orientation. 
Blocks  for  which  the  ratio  r  =  is  bigger  than  0.25 
are  likely  candidates  for  swapping  in  our  Arabic-English 


experiments.  The  ratio  r  itself  is  not  currently  used  in  the 
orientation  model.  The  orientation  model  mostly  effects 
blocks  where  the  Arabic  and  English  words  are  verbs  or 
nouns.  As  shown  in  Eig.  1,  the  orientation  model  uses 
the  orientation  probability  pl(&2)  for  the  noun  block  62, 
and  only  the  default  model  for  the  adjective  block  bi .  Al¬ 
though  the  noun  block  might  occur  by  itself  without  ad¬ 
jective,  the  swapping  is  not  controlled  by  the  occurrence 
of  the  adjective  block  bi  (which  does  not  have  adjacent 
predecessors).  We  rather  model  the  fact  that  a  noun  block 
b  is  typically  preceded  by  some  block  b' .  This  situation 
seems  typical  for  the  block  swapping  that  occurs  on  the 
evaluation  test  set. 
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